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We  consider  the  problem  of  performing  multiplication  of  n-bit 


binary  numbers  on  a  chip.  Let  A  denote  the  chip  area,  and  T  the  time 


required  to  perform  multiplication.  Using  a  model  of  computation  which 


is  a  realistic  approximation  to  current  and  anticipated  VLSI  technology 

-  a  to  2 


for  all  t?  e  CO, 13,  where  A„  and  T„  are  positive  constants  which  depend 

0  0 

on  the  technology  but  are  independent  of  n.  The  exponent  l+a  is  the 
best  possible.  A  consquence  is  that  binary  multiplication  is  "ftarder"-* 
than  binary  addition  if  AT-  is  used  as  a  complexity  measure  for  any 

tie. 
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3. 


In  Brent  and  Rung  [79]  we  give  an  upper  bound  on  A  and  T  for 

the  problem  of  addition  of  n-bit  binary  numbers.  From  this  result  it 

follows  that  binary  multiplication  is  "harder"  than  binary  addition  if 

2ct 

the  complexity  measure  is  AT  (for  any  a  s  0) . 

2.  The  Computational  Model 

Our  model  is  intended  to  be  general,  but  at  the  same  time  real¬ 
istic  enough  to  apply  to  current  VLSI  technology  such  as  MOS. 

We  assume  the  existence  of  circuit  elements  or  "gates"  which  compute  a 
logical  function  of  two  inputs  in  constant  time  and  occupy  at  least  a 
constant  minimum  area.  Gates  are  connected  by  wires  which  have  constant 
minimum  width  (or,  equivalently,  must  be  separated  by  at  least  some 
minimal  spacing) .  Our  measure  of  the  cost  of  a  design  is  the  area  rather 
than  the  number  of  gates  required.  This  is  an  important  difference 
between  our  model  and  earlier  models  of  Winograd  [67],  Brent  [70]  and 
others. 

Assumptions 

In  Sections  3  to  5  we  need  various  subsets  of  the  following 
assumptions  Al  to  A8.  Comments  and  justification  are  given  in  parentheses 
following  the  statement  of  each  assumption.  Our  notation  is  summarized  in 
the  Appendix. 

Al.  The  computation  is  performed  in  a  convex  planar  region  R  of  area  A. 
(Because  of  heat -dissipation  and  packing  requirements,  a  two- 
dimensional  planar  model  is  reasonable.  If  R  is  not  convex  we  may 
take  its  convex  hull.  R  may  be  a  whole  chip  or  part  of  a  chip.} 


A2 ■  Wires  have  minimal  w^dth  X  >  0.  {X  is  assumed  constant,  but  in 

applications  of  our  results  it  will  of  course  depend  on  the  technology: 
see  Mead  and  Conway  [79].}  We  also  assume  R  has  width  at  least  X. 

A3.  At  most  v  i  2  wires  caii  overlap  at  any  point  in  R.  (Otherwise  the 

area  could  be  reduced  by  "folding".  Since  v  i  2,  the  graph  of  wires 
(edges)  and  gates  (nodes)  need  not  be  planar  in  a  graph-theoretic 
sense.} 

A4 .  A  bit  requires  minimal  time  t  >  0  to  propagate  along  a  wire.  The 
time  for  one  gate  computation  and  an  arbitrary  fan-out  of  the 
result  is  included  in  t.  (Since  dimensions  are  limited  by  the 
minimal  wire-width  X  and  minimal  gate  area,  a  minimal  propagation 
time  is  reasonable.  We  do  not  need  to  assume  that  the  propagation 
time  increases  with  the  length  of  the  wire.  This  is  fortunate,  for 
with  current  technology  propagation  times  are  limited  by  wire 
capacitances  rather  than  the  velocity  of  light.  A  longer  wire  will 
generally  have  a  larger  capacitance ,  and  thus  require  a  larger  driver 
to  maintain  constant  propagation  time,  but  the  driver  area  need  not 
exceed  a  fixed  percentage  of  the  wire  area,  so  can  be  ignored  if  X 
is  increased  slightly;  see  Mead  and  Conway  [79].  Although  it  would 

be  reasonable  to  assume  bounded  fanout,  we  do  not  need  this  assumption 

for  proving  lower  bounds.  When  proving  upper  bounds,  we  do  assume 
bounded  fanout.} 

2 

A5 .  I/O  ports  each  have  area  at  least  p  >.  X  .  (If  R  is  a  complete  chip, 

2 

p  will  be  large  compared  to  X  .  If  R  is  only  part  of  a  chip  and  I/O 

2 

is  to  other  regions  on  the  chip,  p  could  be  of  order  X  . 


A6.  Storage  for  one  bit  of  information  takes  area  at  least  6  >  0. 

2 

{B  will  typically  be  larger  than  X  ,  but  we  do  not  need  to  assume  this.} 
A7.  Each  input  bit  is  available  only  once.  {There  is  no  free  memory 

outside  R.  If  the  same  input  bit  is  required  at  different  times,  it 
must  be  stored  within  R,  taking  area  at  least  B  (see  A6 ) . } 

A8.  The  times  and  locations  at  which  input  and  output  bits  are  available 
are  fixed  and  independent  of  the.-values  of  the  input  bits.  {This  is 
necessaryi  if  designs  are  to  be  modular.} 


2 

3.  A  Lower  Bound  on  AT 

With  the  model  of  Section  2,  we  have  the  following  lower  bound 

2 

on  AT  for  any  multiplier  circuit. 


Theorem  3.1 

Under  assumptions  A1  to  A5, 


(3.1) 


where 

(3.2) 


Before  proving  Theorem  3.1  we  need  three  Lemmas. 


Lemma  3.1 

For  any  convex  planar  figure  with  area  A,  perimeter  L, 
diameter  D,  and  chord  of  length  C  perpendicular  to  D, 


7. 


Let  M  be  the  maximum  number  of  elements  of  S  sharing  any  one  output  port . 
(By  assumption,  1  $  M  <  n.)  Let  D  be  a  diameter  of  R,  and  C  a  chord 
perpendicular  to  D,  dividing  S  into  two  parts  and  S£  such  that  the 

output  ports  for  elements  of  lie  on  one  side  of  C,  and  those  for 
elements  of  S£  lie  on  the  other  side  of  C.  Since  we  do  not  use  assumption 
A6,  we  can  assume  that  output  ports  are  shrunk  to  infinitesimal  size  and 
that  (by  an  infinitesimal  perturbation  from  the  perpendicular  to  D)  C  does 
not  intersect  any  output  ports.  By  "sliding"  the  intersection  of  C  and  D 
along  D,  we  can  arrange  that 

(3.6)  $  |S±|  $  |^j  for  i  -  1,2. 


Consider  multipliers  b  =  2^  for  j-0,1 . n-1.  Multiplying 


a  =  a 

n 


a1  by  2J  gives  p1+;J 


a^  for  i»l, , 


Consider  a  -fixed 


with  5  i  $  n.  For  j*n-i . n-1,  we  have  n  $  i+j  <  2n,  so 


pl+j  €  S*  Let 


S3(i) 


if  the  input  port  for  a^  is  on  the  same  side 
of  C  as  the  output  ports  for  elements  of  S^, 

otherwise. 


As  j  ranges  over  n-i,...,n-l,  at  most 


n+M 
_2  _ 


of  the  pi+j  lie  in  S3(i) 


lie  in  S-S  3(i)  .  Thus,  by  definition 


(by  (3.6)),  so  at  least  i  - 

of  S,(i),  a  bit  of  information  must  cross  C  for  each  such  p 

a  total  of  at  least  0+1+2+...+ 


'3 

over  i 


n-M 

2  » ♦ * • »n. 


.  Summing 

r"  .  2 

>  InzMi 
*  8 


/  n4M  \ 

(n  '  L2J/ 


bits  must  cross  C.  Since  there  are  only  n  possible  values  of  j,  there  is 

2 

some  1  for  which  at  least  bits  must  cross  C  before  the  product 

on 

of  a  and  b-2^  can  be  transmitted  through  the  output  ports. 


9. 


Proof  of  Theorem  3.1 

Let  M  be  as  in  the  proof  of  Lemma  3.3.  If  M  *  n,  then  n  output 
bits  share  one  output  port,  and  assumption  A4  gives  T  i  x  n.  Since  there 

,2 

is  at  least  one  output  port,  assumption  A5  gives  A  >,  p  i  a  ,  so 
(3.11)  AX2  *  (Xxn) 2  >  Kxn2  . 


If  M  <  n  then  Lemma  3.3  is  applicable,  and  gives 


AT  *  K2Ln 


so,  from  (3.4), 


AT  5  2K2(ttA)\, 


and  thus 


(3.12) 


2  2  2 
AT  *  4irK2n 


The  result  follows  from  (3.5),  (3.11)  and  (3.12) 


Theorem  3.1  (with  a  smaller  constant  for  K^)  could  have 
been  established  by  a  proof  parallel  to  that  used  by  Thompson  [79]  for 
the  DFT  problem.  In  fact,  using  his  result  that  relates  the  area  of  a 
graph  to  its  minimum  bisection  width,  one  can  prove  Theorem  3.1  without 
the  convexity  assumption  in  Al.  Our  proof,  above,  represents  a  new  ap¬ 
proach  that  incorporates  geometric  considerations  in  the  lower  bound  proof. 

We  feel  that  the  extra  convexity  assumption  we  make  is  not  restrictive, 
since  most  existing  chips  do  have  convex  boundaries  for  packaging  reasons. 
Furthermore,  we  note  that  the  convexity  assumption  is  needed  for  establishing 
results  such  as  Lemma  3.3  that  relates  AT  to  the  perimeter. 


Lemma  4.2 


uOO  * 


21n(N) 


for  all  N  i  4. 


Proof 

Using  a  slight  modification  of  Theorem  1  and  equation  (4.13)  of 
Rosser  and  Schoenfeld  [62],  we  can  show  that 


o(N)  > 


21n(N) 


for  all  N  i  348. 


Thus,  the  result  for  N  i  348  follows  from  Lemma  4.1.  For  4  $  N  $  347, 
the  result  may  be  verified  by  a  straightforward  computation.  I 

Lemma  4.3 

If  5( N)  is  defined  by  (4.3),  then 

S(n)  i  for  all  nil. 


Proof 


From  Lemma  4.2, 


(4.4) 


6  (n)  i  fn-lg(nln2)_[/n. 


and  it  is  easy  to  verify  that  the  right  side  of  (4.4)  is  at  least  5/6  for 
all  n  i  18.  (There  is  equality  for  n  -  18  and  n  ■  24.)  For  1  $  n  $  17, 
direct  computation  shows  that  d(n)  5  9/10.  1 


where 

(4.5) 


Table  4.1  gives  y(2n) ,y(2n)/fi(2n) ,  and  5(n)  for  n-1,2 . 17, 

w(N)  “  0.71+lglgN 


is  an  empirical  approximation  to  y(N).  For  5  $  n  $  17,  the  approximation 
error  is  less  than  1  percent.  If  this  remained  true  for  n  >  17,  it  would 
follow  that  6(n)  i  9/10,  and  the  constant  5/6  in  Lemma  4.3  and  Theorem  4.1 


12 


could  be  increased.  On  the  basis  of  the  empirical  evidence,  we  make  the 
following  conjectures. 


Conjecture  4.1 


<$(n)  i  9/10 


for  all  n  >,  1. 


Conjecture  4.2 


11.  laSlifiBS  .  !. 

N-«  K 


n 

P(2n) 

U(2n)/ja(2n) 

6(n) 

1 

2 

0.355000 

1 

2 

7 

0.748125 

1 

3 

26 

0.932329 

1 

4 

90 

0.952734 

1 

5 

340 

1.006695 

1 

6 

1,238 

0.995890 

1 

7 

4,647 

0.997629 

1 

8 

17,578 

0.995092 

1 

9 

67,592 

1.000412 

1 

10 

259,768 

0.998846 

9/10 

11 

1,004,348 

0.998392 

10/11 

12 

3,902,357 

0.999002 

11/12 

13 

15,202,050 

0.999089 

12/13 

14 

59,410,557 

0.999788 

13/14 

15 

232,483,840 

0.999637 

14/15 

16 

911,689,012 

0.999788 

15/16 

17 

3,581,049,040 

1.000005 

16/17 

Table  4.1  u(2n)  and  related  functions  for  n  ■  1(1)17 
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Proof  of  Theorem  4.1 

If  n  ■  1  there  is  at  least  one  output  port,  so  A  >,  p,  and  the 
result  holds.  Hence,  suppose  that  n  z  2. 

Consider  the  state  of  the  computation  just  before  the  last 
input  bit(s)  are  accepted.  Let  m  be  the  number  of  input  bits  still  to 
be  accepted,  so  1  {  m  f  2n. 

It  is  easy  to  show  that  there  are  some  inputs  a  and  b  such  that 
the  output  bits  P2n*,,,,pn  are  not  determined  by  tbe  2n-m  inPut  bits 
already  accepted.  Thus,  by  assumption  A8,  at  most  n-1  bits  (pn_^» . . . ,p^) 
have  been  output . 

Suppose  that  s  bits  of  information  are  stored  in  R.  Then  we 
must  have  by  assumption  A7 

y (2n)  *  2nH-(n-1)+S  , 

or  the  circuit  could  not  produce  all  y(2n)  possible  outputs,  and  would 
fail  for  certain  inputs.  Thus 

m+s  >,  fig  ii(2n)+l-n"[  ■  n6(n). 


and, from  Lemma  4.3, 

(4.6)  nri-s  t,  5n/6. 


By  assumption  A6, 


(4.7) 


A  i  0s. 


From  Theorem  4.1, 

(5.4)  (A/A0)1_a  *  n1_“  . 

Multiplying  (5.3)  and  (5.4)  gives  the  result.  I 

The  following  Corollary  of  Theorem  5.1  seems  worth  stating 
separately,  for  AT  is  often  used  as  a  complexity  measure  (see,  e.g.,  Mead 
and  Rem  [79]) . 


Note  that 


h  ■  0p/ (6+0)  . 

>5  min(0,o)  5h<  min(0,p)  . 
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2.  By  the  method  of  Section  6,  we  can  perform  binary  multiplication 

2  2 

with  A  -  0 (nig  n) ,  T  -  0 (n  lg  n),  and 


(5.5) 


AT2a  -  0(n1+alg2+4an). 


Thus,  the  exponent  1+a  in  Theorem  5.1  is  the  best  possible. 


3. 

so 


By  a  straightforward  method  we  can  achieve  A  -  0(n  lg  n) ,  T  -  0(lg  n) , 


._2a  ..  2  l+2a  . 

AT  -  0(n  lg  n) . 


Thus,  Theorem  5.1  can  not  be  extended  for  a  >  1. 


4. 


In  Brent  and  Rung  [79]  we  show  that,  for  n-bit  binary  addition. 


AT 


2a  f 


■1 


0(n2ct)  if  0  s  a  «  , 

0(nlg^+2an)  if  a  >  *5  . 


Thus,  binary  addition  is  easier  than  binary  multiplication,  for  all 
complexity  measures  AT  ,  a  z  Ou  (This  holds  for  a  >  1  because 
AT2°  >  AT2  2  K^n2.) 

2ot 

6 .  An  Upper  Bound  on  AT  ' 

It  is  easy  to  design  practical  n-bit  multipliers  with  area 
A  -  0(n)  and  time  T  -  0(n),  so 


(6.1) 


._2oi  l+2a. 

AT  -  0(n  )  . 


In  this  section  we  sketch  the  design  of  a  multiplier  with  A  -  0(nlg  n) 

h  2 

and  T  -  0(n  lg  n) ,  giving 


(6.2) 


AT2“  -  0(n1+alg2+4an), 
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which  is  asymptotically  better  than  (6.1).  The  design  is  not  practical, 

but  it  is  theoretically  interesting  because  it  shows  that  the  exponent 

1  +  a  in  Theorem  5.1  is  sharp.  We  do  not  know  if  there  is  any  practical 
2ol  1+2  a 

design  having  AT  =  o(n  ).  Straightforward  implementations  of  "fast" 

serial  algorithms,  e.g.  the  Schdnhage-Strassen  algorithm  (Schflnhage  and 

Strassen  [71]),  or  the  "3-2  reduction"  algorithm  (Ofman  [62])  seem  to 

2 

require  area  at  least  order  n  . 


In  the  remainder  of  this  section  we  assume: 
2 

1.  n  »  k  is  a  perfect  square,  and 

2.  aj  “  bj  ■  0  if  j  >  n/2. 


(If  not,  n  may  be  increased  sufficiently  without  affecting  the  asymptotic 

results.)  Let  p  be  the  smallest  prime  of  the  form  nq+1,  q  ?  1,  F  the 

finite  field  of  integers  mod  p.  We  assume  that  lg  p  -  0(lg  n) ,  which  is 

certainly  true  in  practice,  as  q  J  84  for  all  n  $  10000.  (If  lg  n  *  0(lg  n) 

we  replace  F^  by  the  complex  field  and  work  to  sufficient  accuracy  to  get 

the  required  results  c^  at  Step  5  below,  or  use  other  methods  described  in 

Adleman  et  al  [78],  Borodin  and  Munro  [75],  and  Aho  et  at  [74].)  Let  u  be 

an  n-th  root  of  unity  in  F  ,  and  w  *  u  (so  w  is  a  k-th  root  of  unity). 

P 

Note  that  in  any  circuit  n  is  fixed,  so  we  are  not  concerned  with  the 


complexity  of  finding  p,  u  etc:  they  will  be  encoded  into  the  circuit. 


In  Steps  1-5  below  all  arithmetic  is  done  in  F^.  In  Steps  1-3 

we  compute  the  Fourier  transform  i  of  (a, . a  )  and  b  of  (b,,...b  ) 

—  1  n  —  1  n 

over  F  , 

P 


Step  2 

Compute  A"  ■  A'oU  and  8"  -  8'oli  ,  where  o  denotes  component¬ 


wise  multiplication. 


» 


19. 


Step  3 

Compute  A"'  ■  A "W  and  8'"  =  8 "W  using  the  same  method  as 
for  Step  1.  It  may  be  shown  that  A'"  and  8’"  contain  the  Fourier  trans¬ 
forms  of  (a, . a  )  and  (b.,...,b  ),  in  fact 

in  in 


and 


Step  4 


Step  5 


A  ’ "  =»  a 

ij  a(j-l)k+i 

8'"ij  “  b(j-l)k+i 


for  1  $  i,j  $  k. 


Compute  C'"  -  A^oB’" 


Compute  C  =  W  ^ (U ' o (C ’ "U  ^))  as  in  Steps  1-3.  Here 


tl'.  “  u  ^  C  represents  the  inverse  Fourier  transform  of  C’".  If 

ij  ' 

C  »  c 
ij  C(i-l)k+j 


then,  by  the  convolution  Theorem  and  our  assumption  2  above. 


Thus, 


Cj  "  albj  +  a2bj-l  +  *•*  +  ajbl 


2n  .  .  n  .  , 

I  Pi2  “  I  V  , 

i-1  i-1 


for  1  $  j  $  n. 


and  the  problem  of  computing  P2n»*"*Pi  has  been  reduced  to  the  problem 
of  summing  0(lg  n)  numbers  of  at  most  2n  bits  (since  the  c  have  0(lg  n) 
bits).  Hence,  the  final  step  in  the  computation  is: 

Step  6 

Compute  P2n . p^  from  the  c^. 

This  may  be  done  by  0(lg  n)  additions,  each  requiring  area  0(lg  n)  and 
time  0(lg  n) :  see  Brent  and  Rung  [793. 


This  completes  our  outline  of  the  multiplier  with  area 

A  =  0(nlg2n),  time  T  -  0(n*5lg2n),  and  AT2a  »  0(n1+alg2+^an) .  The 

exponent  2+4a  of  lgn  can  certainly  be  reduced,  but  we  do  not  know  what 

its  minimal  value  is. 

7.  Some  Open  Problems 

Our  results  suggest  several  interesting  problems: 

1.  Can  the  constants  Aq  and  Tq  be  increased? 

2.  How  far  can  the  gap  Odg^  an)  between  the  upper  and  lower  bounds 
be  reduced? 

3.  Is  there  a  practical  design  with  AT2ct  -  OCn^*01^),  for  all  e  >  0? 

4.  Can  any  of  our  assumptions  A1  to  A8  be  relaxed? 

5.  Can  the  restriction  to  binary  representation  be  removed? 

6.  For  binary  division  it  is  easy  to  deduce  a  lower  bound  of  the  same 
form  as  (5.1),  using  the  method  of  Brent  [76];  and  an  upper  bound 
AT2ot  =  0(n''‘+alg2"t’^otn), using  Newton's  method.  Thompson  £79}  has 
proved  a  lower  bound  like  (5.1)  for  computation  of  the  discrete 
Fourier  transform,  using  a  model  similar  (though  not  identical)  to 
ours.  Can  similar  upper  and/or  lower  bounds  be  proved  for  other 


computations? 
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Appendix  :  Summary  of  Notation 


input  to  multiplier,  0  5  a  <  2,  a  =  a^.a^  in  binary  notation, 
i-th  least  significant  bit  of  a,  1  £  i  S  n. 

(a1,...,an)  is  Fourier  transform  of  (a^ . an)  over  F  . 

area  of  region  R.  See  assumption  Al. 


k  by  k  matrix  (A^).  See  Section '  Similarly  for  A',  A"  and  A'". 

input  to  multiplier,  0  $  b  <  2n,  b  =  bn...b^. 

i-th  least  significant  bit  of  b,  1  5  i  $  n. 

)  is  Fourier  transform  of  (b  ,...,b  )  over  F  . 

In  JL  n  p 

k  by  k  matrix.  See  Section  6.  Similarly  for  8',  8"  and  8'". 
chord  (almost)  perpendicular  to  D,  or  length  of  chord.  See 
Section  3. 

k  by  k  matrix.  See  Section  6.  '  Similarly  for  C"'. 
diameter  of  R,  or  length  of  diameter.  See  Section  3. 
finite  field  of  p  elements.  See  Section  6. 

Bp/ (B+p) • 

nonnegative  integer. 

nonnegative  integer. 

J5 

n  .  See  Section  6. 

constant,  1  $  i  S  3.  See  (3.2),  (3.5)  and  (5.5). 
log2  .  lg^n  denotes  (log2(n))^. 
logg  . 

perimeter  of  R,  or  length  of  the  perimeter. 

number  of  input  bits  still  to  be  accepted.  See  proof  of  Thm.  4.1. 


maximum  number  of  elements  of  S  sharing  any  output  port. 

See  Lemma  3.3. 

number  of  bits  in  inputs  a  and  b. 
positive  integer. 

prime  number  (p  >  1) .  (p  =  nq+1  in  Section  6 . ) 

i-th  least  significant  output  bit,  output  ■  P2n*‘’Pi  binary 

notation. 

{ij  |  0  $  i  <  N,  0  $  j  <  N}.  See  Section  4. 

{p  |  p  prime,  2  J  p  J  N}  . 
positive  integer. 

M/n.  See  Lemma  3.3.  (Free  variable  in  Lemma  3.2.) 

region  in  which  computation  is  performed.  See  assumption  Al. 

number  of  bits  of  information  stored  in  R.  See  proof  of  Thm.  4.1. 

(p2n_j_, . . .  ,Pn)  •  See  proof  of  Lemma  3.3. 

subsequence  of  S,  1  $  i  $  3. 

time  required  for  computation. 

constant  defined  by  equation  (5.2). 

n-th  root  of  unity  in  F  . 

k  by  k  matrix.  See  Section  6.  Similarly  for  U'. 

k-th  root  of  unity  in  F  .  w  -  u  . 

P 

k  by  k  matrix.  See  Section  6. 
free  variable,  0  $  a  $  1. 

minimum  area  required  to  store  one  bit.  See  assumption  A6. 

function  defined  by  equation  (4.3). 

minimum  width  of  a  wire.  See  assumption  A2. 


fl(N)  -  N2/(0.71  +  lglgN) ,  approximation  to  p(N). 
maximum  number  of  wires  which  can  overlap  at  any  point 
See  assumption  A3. 

minimum  area  for  an  I/O  port.  See  assumption  A5. 
o(N)  -l  p  . 

peVl 

minimum  time  for  propagation  of  one  bit  along  a  wire, 
assumption  A4 . 


See 


